Taking Turing Seriously (but Not Literally)
نویسنده
چکیده
Recent results from present-day instantiations of the Turing test, most notably the annual Loebner Prize competition, have fueled the perception that the test has either already been passed or that it soon will be. With this perception comes the implication that computers are on the verge of achieving human-level cognition. We argue that the perspective on the Turing test is flawed because it lacks an essential component: that of unbounded creativity on behalf of the questioner, which in turn induces a cooperative responsibility on behalf of the respondent. We discuss how the decades-spanning program of activity at Indiana University’s Center for Research on Concepts and Cognition represents one (perhaps unique) approach to taking Turing seriously in this respect. 1 | Introduction: The Turing Test in Letter and Spirit If Turing were alive today, what would he think about the Turing test? Would he still consider his “imitation game” an effective means of gauging intelligence, given what we now know about the Eliza effect, chatbots, and the increasing vacuity of interpersonal communication in the age of texting and instant messaging? Alas, one can only speculate, but we suspect he would find the prevailing interpretation of his classic “Computing Machinery and Intelligence” (1950) to be disappointingly literal-minded. Current instantiations of his eponymous test, most notably the annual Loebner Prize competition, adhere closely to the outward form—or letter—of the imitation game he proposed. However, it is questionable how faithful such competitions are to the underlying purpose—or spirit—of the game. That purpose, lest we forget, is to assess whether a given program demonstrates intelligence 1 --and if so, to what extent. But this purpose gets obscured when the emphasis turns from modeling intelligence to simply “beating the test”. The question, then, is whether we might better capture the spirit of Turing’s test through other, less literal-minded means. Our answer is not only that we can, but that we must, at least if the Turing test is going to remain a useful standard-bearer for assessing progress in AI. The alternative is to risk trivializing the Turing test by equating “intelligence” with the ability to mimic the sort of context-neutral conversation that has increasingly come to pass for “communication.” This perspective favors methodologies such as statistical machine learning that we claim are better suited to modeling human banality than human intelligence. It is actually our belief that such methodologies will ultimately fail even at this more humble goal. Our essential claim could be summarized as follows: “Unless we can work out how to build a genius, we won’t even be able to build an idiot.” The motivating perspective for this statement is that cognitive mechanisms such as metacognition and domain-assisted perception are actually vital for generating the utterances of both the insightful and the ignorant, but the nature of these mechanisms is thrown into sharp relief by the subtle and economical insights afforded to “genius.” 2 | Intelligence, Trickery, and the Loebner Prize Does the ability to deceive others presuppose intelligence, or merely a knack for deception? In proposing his imitation game, Turing wagered that the two were inseparable. However, as Shieber (1994) observes, “[I]t has been known since Weizenbaum’s surprising experiences with ELIZA that a test based on fooling people is confoundingly simple to pass” (p. 72; cf. Weizenbaum 1976). The gist of Weizenbaum’s realization is that our interactions with computer programs often tell us less about the inner workings of the programs themselves than they do about our tendency to project meaning and intention onto artifacts, even when we should know better. A Parallel Case: Art Forgery For another perspective on distinction between genuine accomplishment and mere trickery, let us consider the parallel case of art forgery. Is it possible to draw the distinction between a genuine artist, on the one hand, and a mere faker, on the other? It is tempting to reply that in order to be a good faker—one good enough to “fool the experts”—one must necessarily be a good artist to begin with. But this sort of argument is too simplistic, as it equates artistic quality with technical skill and prowess, meanwhile devaluing the role of originality, artistic vision, and other qualities that we typically associate with genuine artistry (cf. Lessing 1965; Dutton 1979). In particular, 1 Note that we are content to restrict our concern to a test for humanocentric intelligence; see French (1990) for a discussion on this issue. the ability of a skilled art forger to create a series of works in the style of, say, Matisse does not imply insight into the underlying artistic or expressive vision of Matisse. In other words, “There is all the difference in the world between a painting that genuinely reveals qualities of mind to us and one which blindly apes their outward show” (Kieran 2005, p. 21). Russell’s famous quote (above) about specification equating to theft helps us relate an AI methodology to the artistry–forgery distinction. Russell’s statement can be paraphrased as follows: merely saying that there exists a function (e.g., sqrt()) with some property (e.g., sqrt(x)*sqrt(x)=x for all x >= 0) does not tell you very much about how to generate the actual sqrt() function. Similarly, the ability to reproduce a small number of values of x that meet this specification does not imply insight into the underlying mechanisms of which the existence of these specific values is essentially a side effect. A key issue here is the small number of values: Since contemporary versions of the Turing test are generally highly time-constrained, this makes it even more imperative that the test involve a deep probe into the possible behaviors of the respondent. Many of the Loebner Prize entrants have adopted the methodologies of corpus linguistics and machine learning (ML), so we will re-frame the issue of thematic variability in these terms. We might abstractly consider the statistical ML approach to the Turing test as being concerned with the induction of a generative grammar. The ability to induce an algorithm that reproduces some themed collection of original works does not in itself imply that any underlying sensibilities that motivated those works can be effectively approximated by that algorithm. One way of measuring the “work capacity” of an algorithm is to employ the Kolmogorov complexity measure (Solomonoff 1964), which is essentially the size of the shortest possible functionally identical algorithm. In the induction case, algorithms with the lowest Kolmogorov complexity will tend to be those that exhibit very little variability—in the limiting case, generating only instances from the original collection. (This would be analogous to a forger who could only produce exact copies of another artist’s works, rather than works “in the style of” said artist.) In contrast, programs from the Fluid Analogies family of architectures possess domainspecific relational and generative models. For example, the Letter Spirit architecture (Rehling 2001) is specifically concerned with exploring the thematic variability of a given font style. Given Letter Spirit’s sophisticated representation of the “basis elements” and “recombination mechanisms” of form, it might reasonably be expected to have high Kolmogorov complexity. The thematic variations generated by Letter Spirit are therefore not easily approximated by domain-agnostic data-mining approaches. The artistry–forgery distinction is useful in so far as it offers another perspective on the issue of depth versus shallowness—an issue that is crucial in any analysis of the Turing test. Likewise, just as the skilled art forger is adept at using trickery to simulate authenticity—for example, by artificially “aging” a painting through various techniques such as baking or varnishing (Werness 1983)—similar forms of trickery often find their way into the Loebner Prize competition: timely pop-culture references, intentional “typos” and misspellings, and so on (cf. Shieber 1994; Christian 2011). Yet these surface-level tricks have as much to do with the modeling of intelligence as coating the surface of a painting with antique varnish has to do with bona fide artistry. This essentially adversarial approach is a driving force in the divergence of current instantiations of the Turing test from the spirit of the test as it was originally conceived. It is our contention that a test that better meets the original intent should be driven by the joint aims of creativity and collaboration. 3 | Taking Turing Seriously: An Alternative Approach In order to emphasize the role of “unbounded creativity” in the evaluation of intelligence, we describe a Feigenbaum test—essentially a “micro-Turing-test” (Feigenbaum, 2003)—restricted to the microdomain of analogies between letter-strings. For example, “If abc changes to abd, how would you change pxqxrx in ‘the same way’?” (or simply abc → abd; pxqxrx → ???, to use a bit of convenient shorthand). Problems in this domain have been the subject of extensive study (Hofstadter et al. 1995), resulting in the creation of the well-known Copycat model (Mitchell 1993) and its successor, Metacat (Marshall 1999). Although apparently highly restricted, problems in this domain can nonetheless exhibit surprising subtleties. We proceed to give some examples, described concretely in terms of the mechanisms of Copycat and Metacat, which are two instantiations of what we refer to more broadly as Fluid Concepts architectures. Copycat, Metacat, and Fluid Concepts Architectures Copycat’s architecture consists of three main components, all of which are common of the more general Fluid Concepts architectural scheme. These components are the Workspace, which is essentially roughly the program’s working memory; the Slipnet, a conceptual network with variably weighted links between concepts (essentially a long-term memory); and the Coderack, home to a variety of agent-like codelets, which perform specific tasks in (simulated) parallel, without the guidance of an executive controller. For example, given the problem abc → abd; iijjkk → ???, these tasks would range from identifying groups (e.g., the jj in iijjkk) to proposing bridges between items in different letter-strings (e.g., the b in abc and the jj in iijjkk) to proposing rules to describe the change in the initial pair of strings (i.e., the change from abc to abd). (See Mitchell 1993 for an in-depth discussion of codelet types and functions in Copycat.) Building on Copycat, Metacat incorporates some additional components that are not present in its predecessor’s architecture. The most notable of these are the episodic memory and the temporal trace. As the name suggests, the emphasis in Metacat is meta-cognition, which can broadly be defined as the process of monitoring, or thinking about, one’s own thought processes. What this means for Metacat is an ability to monitor—via the temporal trace—events that take place en route to answering a given letter-string problem, such as detecting a “snag” (e.g., trying to find the successor to z, which leads to a snag because the alphabet is does not “circle around” in this domain) or noticing a key idea. Metacat also keeps track of its answers to previous problems, as well as its responses on previous runs of the same problem, both via the episodic memory. As a result, it is able to be “reminded” of previous problems (and answers) based on the problem at hand, which ultimately amounts to the making of meta-analogies (or analogies between analogies). Finally, it is able to compare and contrast two answers at the user’s prompting. Philosophically speaking, these architectures are predicated upon the conviction that it is possible to “know everything about” the Platonic entities and relationships in a given microdomain. In other words, these architectures are declarative in the sense of Jackson’s color scientist in the “Black and White Room” (Jackson 1982). In other words, there is nothing about domain entities and processes (or the effect of the latter on the on the former) that is not accessible to introspection. In Copycat, the entities range from permanent “atomic” elements (primarily, the 26 letters of the alphabet) to more temporary, composite ones, such as the letter strings that make up a given problem (abc, iijjkk, pxqxrx, etc.), the groups within letter strings that are perceived during the course of a run (e.g., the ii, jj, and kk in iijjkk), and the bonds that are formed between letters such groups. The relationships include concepts such as same, opposite, successor, predecessor, and so on. A key aspect of the Fluid Concepts architecture is its ability to explore the space of instantiations of those entities and relationships in a (largely) non-stochastic fashion—that is, in a manner that is predominately directed by the nature of the relationships themselves. In contrast, the contextual pressures that give rise to some subtle yet low frequency solution are unlikely to have a referent within a statistical ML model built from a corpus of Copycat answers, since outliers are not readily captured by gross mechanisms such as sequences of transition probabilities. An example from the Copycat microdomain To many observers, a letter-string analogy problem such as the aforementioned abc → abd; iijjkk → ??? will likely appear trivial at first glance. Yet upon closer inspection, one can come to appreciate the surprising subtleties involved in making sense of even a relatively basic problem like this one. Consider the following (non-exhaustive) list of potential answers: ● iijjll – To arrive at this seemingly basic answer requires at least two non-trivial insights: (1) mapping the kk sameness group in iijjkk as playing the same role that the the letter c does in abc; and (2) seeing the change from c to d in terms of successorship merely than rather than merely as a change from the letter c to the letter d. The latter point may seem trivial, but it is not a given, and as we will see, there are other possible interpretations. ● iijjkl – This uninspiring answer results from simply changing the letter category of the last letter in the string to its successor rather than changing the letter category of the last group. ● iijjkd – This answer results from the literal-minded strategy of simply changing the last letter in the string to d, all the while ignoring the other relationships among the various groups and letter categories. ● iijjdd –This semi-literal, semi-abstract answer falls somewhere in between iijjll and iijjkl on the quality scale. On the one hand, it reflects a failure to perceive the change from c to d in the initial string in terms of successorship, instead treating it as a mere replacement of the letter c with the letter d. On the other hand, it does signal a recognition that the concept group is important, as it at least involves carrying out the change from k to d in the target string over both k’s as opposed to just the rightmost one. This kind of answer arguably has a “humorous” quality to it, more so than iijjkl or iijjkd, due to its mixture of insight and confusion. This incomplete catalog of answers hints at the range of issues that can arise in examining a single problem in the letter-string analogy domain. Copycat itself is able to come up with all of the aforementioned answers (along with a few others), as illustrated in Table 1. As the table shows, iijjll appears as the program’s “preferred choice” according to the two available measures. These measures are (1) the relative frequency with which each answer is given and (2) the average “final temperature” associated with each answer. Roughly speaking, the temperature—which can range from 0 to 100—indicates the program’s moment-to-moment “happiness” with its perception of the problem during a run, with a lower temperature corresponding to a more positive evaluation. Table 1. Copycat's performance over 1000 runs on the problem abc abd; iijjkk ??? Adapted from Mitchell (1993). The Feigenbaum Test: From Copycat to Metacat One limitation of Copycat is its inability to “say” anything about the answers it gives beyond what appears in its Workspace during the course of a run. While aggregate statistics such as those illustrated in Table 1 can offer some insight into its performance, the program is not amenable to genuine Feigenbaum testing because it doesn’t have the capacity to summarize its viewpoint: the temperature. To the extent that it can be Feigenbaum tested, it can only do so in response to what might termed first-order questions (e.g., abc → abd; iijjkk → ???). It cannot answer second-order questions (i.e., questions about questions), let alone questions about its answers to questions about about questions. In contrast, Metacat allows us to ask increasingly sophisticated questions in our Feigenbaum tests. For one thing, it is ability to directly compare answers to one another, summarizing its viewpoint with one of a set of canned (but non-arbitrary) English descriptions. For example, the preferred answer “is based on a richer set of ideas,” “is more abstract,” “is more coherent,” or “involves no unjustified ideas.” It also attempts to explain how the two answers are similar to each other and how they differ. For example, consider the program’s summary of the comparison between iijjll and iijjdd in response to the aforementioned problem: The only essential difference between the answer iijjdd and the answer iijjll to the problem “abc → abd, iijjkk → ???” is that the change from abc to abd is viewed in a more literal way for the answer iijjdd than it is in the case of iijjll. Both answers rely on seeing two strings (abc and iijjkk in both cases) as groups of the same type going in the same direction. All in all, I’d say iijjll is the better answer, since it involves seeing the change from abc to abd in a more abstract way. For the sake of contrast, here is the program’s comparison between the answers iijjll and abd: The only essential difference between the answer abd and the answer iijjll to the problem “abc → abd, iijjkk → ???” is that the change from abc to abd is viewed in a completely different way for the answer abd than it is in the case of iijjll. Both answers rely on seeing two strings (abc and iijjkk in both cases) as groups of the same type going in the same direction. All in all, I’d say abd is really terrible and iijjll is very good. It should be emphasized that the specific form of the verbal output is actually extremely unsophisticated relative to the capabilities of the underlying architecture, indicating that it is possible to exhibit depth of insight while treating text generation almost as a side-effect. This is really the exact opposite of contemporary approaches to the Turing test. Apart from the thin veneer of human agency that results from Metacat’s text generation, the program’s accomplishments—and failures—are essentially transparent. (In fact, the text generation is an optional feature; it can be disabled by deselecting the “Eliza mode” option.) One can “interact” with the program in a variety of ways: by posing new problems; by inputting an answer to and running the program in “justify mode,” asking it to make sense of and evaluate the answer; by having it answer compare two answers to one another (as in the above examples); and more. In order for it to actually pass an “unrestricted Feigenbaum test” in the letter-string analogy domain, what other questions might we conceivably require Metacat to answer? Here are some suggestions: 1. Questions in which the relationships can be perceived as forming a numerical pattern -for example, aab cccd; eeeef ???. This kind of sequence extrapolation was extensively studied in the SeekWhence architecture (Meredith, 1986), one of Copycat’s progenitors, and such a facility could readily be added to Metacat. 2. Meta-questions about sequences of answers such as, “Why is the relationship between answer A and answer B different from that between C and D?” It is our contention that such questions could be answered using the declarative information that Metacat already has; all that would be required is the ability to pose the question. 3. Questions involving the origination of new letter-string analogy problems. Some work in this area was done in the Phaeaco Fluid Analogies architecture (Foundalis 2006), but the issue requires further investigation. 4. Questions of the form, “Why is answer A more humorous (or stranger, or more elegant, etc.) than answer B?” For adjectives such as “humorous” that presuppose the possession of emotional or affective states, it is not at all clear what additional mechanisms might be required, though some elementary possibilities are outlined in Picard (1997). To meet the genuine intent of the Turing test, we must therefore surely be able to partake in this sort of arbitrarily detailed and subtle discourse in any domain. It seems unlikely that anyone would contend that this lies remotely within the capabilities of any of the current generation of Loebner contenders.
منابع مشابه
Bayesian perspective over time
Thomas Bayes, the founder of Bayesian vision, entered the University of Edinburgh in 1719 to study logic and theology. Returning in 1722, he worked with his father in a small church. He also was a mathematician and in 1740 he made a novel discovery which he never published, but his friend Richard Price found it in his notes after his death in 1761, reedited it and published it. But until L...
متن کاملSome improvements in fuzzy turing machines
In this paper, we improve some previous definitions of fuzzy-type Turing machines to obtain degrees of accepting and rejecting in a computational manner. We apply a BFS-based search method and some level’s upper bounds to propose a computational process in calculating degrees of accepting and rejecting. Next, we introduce the class of Extended Fuzzy Turing Machines equipped with indeterminacy s...
متن کاملThe Ghost in the Quantum Turing Machine
In honor of Alan Turing’s hundredth birthday, I unwisely set out some thoughts about one of Turing’s obsessions throughout his life, the question of physics and free will. I focus relatively narrowly on a notion that I call “Knightian freedom”: a certain kind of in-principle physical unpredictability that goes beyond probabilistic unpredictability. Other, more metaphysical aspects of free will ...
متن کاملThe iterative conception of set A (bi-)modal axiomatisation
The use of tensed language and the metaphor of set ‘formation’ found in informal descriptions of the iterative conception of set are seldom taken at all seriously. Both are eliminated in the nonmodal stage theories that formalise this account. To avoid the paradoxes, such accounts deny the Maximality thesis, the compelling thesis that any sets can form a set. This paper seeks to save the Maxima...
متن کاملThe Computational Theory of Mind
computational description that could be physically implemented in diverse ways (e.g. through silicon chips, or neurons, or pulleys and levers). CCTM holds that a suitable abstract computational model offers a literally true description of core mental processes. It is common to summarize CCTM through the slogan “the mind is a Turing machine.” This slogan is also somewhat misleading, because no o...
متن کاملLife Expectancy Changes for Each Subway Station: Taking Social Determinants of Health Seriously More Than Ever
Life Expectancy Changes for Each Subway Station: Taking Social Determinants of Health Seriously More Than Ever Reza Esmaeili 1* 1Assistance Professor, Department of Community Medicine, Social Determinants of Health Research Center, School of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran Abstract Nowadays, health inequalities are emerging at short geographic distances, even ...
متن کامل